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Abstract 

A distance-based method to reconstruct a phylogenetic tree with n leaves takes a distance 
matrix, n x n symmetric matrix with Os in the diagonal, as its input and reconstructs a tree 
with n leaves using tools in combinatorics. A safety radius is a radius from a tree metric (a 
distance matrix realizing a true tree) within which the input distance matrices must all he in 
order to satisfy a precise combinatorial condition under which the distance-based method is 
guaranteed to return a correct tree. A stochastic safety radius is a safety radius under which 
the distance-based method is guaranteed to return a correct tree within a certain probability. 
In this paper we investigated stochastic safety radii for the neighbor-joining (NJ) method and 
balanced minimal evolution (BME) method for n = 5. 


1 Introduction 

A phylogenetic tree (or phylogeny ) on the set X = [n\ is a graph which summarizes the relations of 
evolutionary descent between different species, organisms, or genes. Phylogenetic trees are useful 
tools for organizing many types of biological information, and for reasoning about events which 
may have occurred in the evolutionary history of an organism. There has been much research 
on phylogenetic tree reconstructions from alignments, and distance-based, methods are some of the 
best-known phylogenetic tree reconstruction methods. 

Once we compute pairwise distances V(a;,y) € X x X from an alignment, we can reconstruct a 
phylogenetic tree via distance-based methods. In contrast with parsimony methods, distance-based 
methods have been shown to be statistically consistent in all settings (such as the long branch 
attraction) [H 0 01 [7] . Distance-based methods also have a huge speed advantage over parsimony 
and likelihood methods in terms of computational time, and hence enable the reconstruction of 
trees with large numbers of taxa. However, a distance-based method is not a perfect method to 
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reconstruct a phylogenetic tree from the input sequence data set: in the process of computing a 
pairwise distance, we ignore interior nodes of a tree as well as a tree topology, and thus we lose 
information from the input sequence data sets. Therefore it is important to understand how a 
distance based method works and how robust it is with noisy data sets. 

One way to measure its robustness is called the safety radius. A safety radius is a radius from a 
tree metric (a distance matrix realizing a true tree) within which the input distance matrices must 
all lie in order to satisfy a precise combinatorial condition under which the distance-based method 
is guaranteed to return a correct tree. More precisely, we have the following definition. 

Definition 1. Suppose we have a vector representation of all pairwise distances 6 £ mW and 
suppose dT tW := (d xy ) x , y ex > where T £ r n , r n is the set of all phylogenetic unrooted trees with 
leaves X = [n], and w £ Ml" -3 , where R + is the set of all non-negative real numbers, is a vector 
representation of the set of branch lengths in T, is a tree metric, i.e., d xy > 0 is the total of branch 
lengths in the unique path from a leaf x to a leaf y in T. Let w m in be the smallest interior branch 
length in T. Then a method M for reconstructing a phylogenetic X-tree from each distance matrix 
S on X is said to have a 1^ safety radius p n if for any binary phylogenetic tree T with n leaves we 
have: 

11 8 dx,w | |oo ^ Pn ' W m i n =r> M ($) = T. 

Notice that the definition of the safety radius defined in Definition[l]is deterministic even though 
the input data <5 is a multivariate random variable. Thus, this is more meaningful to define in terms 
of probability distribution. Thus, in 2014 Steel and Gascuel introduced a notion of stochastic safety 
radius [5j. 

c 2 

Definition 2 (Stochastic safety radius). Suppose we allow a 2 to depend on n: a 2 = -—, , , for 

log (n) 

some value c / 0. For any r/ > 0, we say that a distance-based tree reconstruction method M 
has ij-stochastic safety radius s = s n if for every binary phylogenetic X-tree T on n leaves, with 
minimum interior edge length w m i n; and with the distance matrix 8 on X described by the random 
errors model, we have 

c < s ■ u; m in => P(M(8) = T) > 1 — rj. 


In this paper we focus on two distance-based methods, namely neighbor-joining (NJ) method 
and balanced minimal evolution (BME) method. In 2002, Desper and Gascuel introduced a BME 
principle, based on a branch length estimation scheme of Pauplin [13]. The guiding principle of 
minimum evolution tree reconstruction methods is to return a tree whose total length (sum of 
branch lengths) is minimal, given an input dissimilarity map. The BME method is a special case 
of these distance-based methods wherein branch lengths are estimated by a weighted least-squares 
method (in terms of the input 8 and the tree T £ r„ in question) that puts more emphasis on shorter 
distances than longer ones. Each labeled tree topology gives rise to a vector, called herein the BME 
vector, which is obtained from Pauplin’s formula. In 2000, Pauplin showed that the BME method 
is equivalent to optimizing a linear function, the dissimilarity map, over the BME representations 
of binary trees, given by the BME vectors m ■ Eickmeyer et. al. defined the n th BME polytope 
as the convex hull of the BME vectors for all binary trees on a fixed number n of taxa. Hence the 
BME method is equivalent to optimizing a linear function, namely, the input distance matrix 8, 
over a BME polytope. They characterized the behavior of the BME phylogenetics on such data 
sets using the BME polytopes and the BME cones, i.e., the normal cones of the BME polytope. 



The study of related geometric structures, the BME cones, further clarifies the nature of the 
link between phylogenetic tree reconstruction using the BME criterion and using the NJ Algorithm 
developed by Saitou and Nei [14]. In 2006, Gascuel and Steel showed that the NJ Algorithm, one of 
the most popular phylogenetic tree reconstruction algorithms, is a greedy algorithm for finding the 
BME tree associated to a distance matrix 5 The NJ Algorithm relies on a particular criterion 
for iteratively selecting cherries; details on cherry-picking and the NJ Algorithm are recalled later 
in the paper. In 2008, based on the fact that the selection criterion for cherry-picking is linear in the 
distance matrix S [2], Eickmeyer et. al. showed that the NJ Algorithm will pick cherries to merge 
in a particular order and output a particular tree topology T if and only if the pairwise distances 
satisfy a system of linear inequalities, whose solution set forms a polyhedral cone in r( 2) [5]. They 
defined such a cone as an NJ cone. In general, the sequence of cherries chosen by the NJ Algorithm 
is not unique, hence multiple distance matrix S will be assigned by the NJ Algorithm to a single 
fixed tree topology T. The set of all distance matrix 6 for which the NJ Algorithm returns a fixed 
tree topology T is a union of NJ cones, however this union is not convex in general. Eickmeyer et. 
al. characterized those dissimilarity maps for which the NJ Algorithm returns the BME tree, by 
comparing the NJ cones with the BME cones, for eight or fewer taxa [5] . 

In this paper we use the BME cones and NJ cones in order to investigate their stochastic safety 
radius for n = 5. Here we assume that the multivariate random variable S is defined as follows: 

$xy dxy T e X y , 

where e xy ~ _/V(0,<r 2 ), the Gaussian distribution with mean 0 and a standard deviation cr > 0, are 
independent for all pairwise distance (x,y) £ X x X. This paper is organized as follows: Section 
[2] shows the probability distribution of a random 5 so that it satisfies the four point rule for all 
distinct quartets in [n] with a fixed T. Zarestkii in m defined the notion of the four point rule as 
follows: we select the tree topology xy\wz (which means there is an internal edge between {x,y} 
and {re, 2 } for a distinct x,y,w,z £ [n]) if 

&xy T &wz < mln-( S X w + $yzi $xz ~\~ &yw }■ (i) 

In Section [ 3 ] we will show multivariate probability distribution P(M(5) = T) where T £ t 5 is 
fixed and M is the BME method, in Section [4] shows the multivariate probability distribution 
P(M(S) = T ) where T £ T5 is fixed and M is the NJ method. Finally in Section [5] we will show 
some computational results on these probability distributions and we have shown the plot for the 
stochastic safety radii for the NJ and the BME methods varying ?y and c for n = 5 (Figure [6]). As 
shown in Figure [6] both stochastic safety radii are basically almost identical in this case since the 
probability distributions P(M(S) = T) for the NJ and for the BME methods are almost identically 
same shown in Figure [5] for n = 5 and w m - ln = 1. 

2 Probability distribution on “four point rule” 

For a tree containing random errors, the pairwise distance between two leaves is 

d xy — d xy T e X y 

where x and y are different taxas of a tree, d xy is the true pairwise distance between taxa x and 
y , and e' xy s follow i.i.d. Gaussian Distribution with mean 0 and variance a 2 . Intuitively in this 


section we are computing a probability distribution such that if we select a random $ £ Rw, $ 
satisfies Equation [I] if and only if there is an internal edge between {x, y} and {«;, zj in T £ t „, for 
all distinct {a;, y , w, z} £ [n]. We hnd a formula for the probability, for 5 taxa, that a tree metric 
with random errors still obeys the original four-point inequalities on each subset of four leaves. 

We first consider four point rule on 4 taxa 
tree. Suppose Figure [T] is the true tree. Then -| 

for a random tree, the following inequalities 
must be satisfied in order to return the cor¬ 
rect tree: 

$12 + $34 < $13 + $24 
$12 + $34 < $14 + $23 



Since 


Figure 1: 4 taxa tree 


$12 = ei + e2 + ei2 

$34 = 63 + e 4 + £34 

$13 = ei + e 3 + w + £13 

$24 = e 2 + e^ + W + £24 

$14 = ei + e4 + W + £14 

$23 = e 2 + e 3 + W + £23 


(3) 


Then we can have 


£12 + £34 < 2 W + £13 + £24 
£12 + £34 < 2 W + £14 + £23 


(4) 


Since e xy N( 0, a 2 ), we know ei 2 + £ 34 , £13 + £ 24 , £14 + £23 AT(0, 2cr 2 ). Let / and F be the 
density and cumulative distribution functions of N(0, 1), respectively. Then the probability that 
Inequality [4] is satisfied, i.e. the probability that a random distance matrix $ returns the true tree, 
equals to: 


J°°J( X )[1-F(-^)\ 2dx (5) 

Now we consider four point rule on 5 taxa tree. Suppose the true tree is Figure [2(a)] We need 
to check the rule on all possible combinations of four distinct leaves in this tree. It is trivial to see 
we only have 5 different combinations. For each of them, we could construct two inequalities similar 
to the way we obtained Equation [4j Therefore, we have 10 inequalities for the 5 combinations of 4 
distinct taxa: 
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then the 10 inequalities are: 

Ue < (2wi, 2wi,2wi, 2u>i,2wi + 2w2 , 2w\ + 2u>2 , 2w2 ,2^2, 2w2, 2w2) T ■ (7) 

Thus the probability that a 5-leaved tree metric with random errors still obeys the original-four 
point inequalities on each subset of four leaves is the probability that inequality (J7| is satisfied. 


3 Probability distribution of the output tree via the BME 
method 

This method begins with a given set of n items and a symmetric (or upper triangular) square n x n 
distance matrix whose entries are numerical dissimilarities, or distances, between pairs of items. 
From the distance matrix the BME method constructs a binary tree with the n items labeling the 
n leaves. The BME tree has the property that the distances between its leaves most closely match 
the given distances between corresponding pairs of taxa. 

By “most closely match” in the previous paragraph we mean the following: the reciprocals of 
the distances between leaves are the components of a vector c, and this vector minimizes the dot 
product c • S where S is the list of distances in the upper triangle of the distance matrix. 

More precisely: Let the set of n distinct species, or taxa, be called X. For convenience we will 
often let X = [n] = {1,2,..., n}. Let vector 6 be given, having (?{) real valued components S xy , one 
for each pair {x,y} C X. There is a vector c (t) for each binary tree t. on leaves X , also having (™) 
components c xy {t ), one for each pair {x, y} C X. These components are ordered in the same way for 
both vectors, and we will use the lexicographic ordering: 5 = (< 5 i 2 , £> 13 ,..., S i„, ^ 23 , c> 24 > • • •, <5n,-i,n)- 
We define, following Pauplin |iT3l : 


c xy(t) - 2l(x,y) 

where l(x,y) is the number of internal nodes (degree 3 vertices) in the path from leaf x to leaf 
y. The BME tree for the vector S is the binary tree t that minimizes 6 ■ c(t) for all binary trees 
on leaves X. Rather than the original fractional coordinates c xy we will scale by a factor of 2™ -2 , 
giving coordinates 

Y _ 0"-2 r _ nn —2 — l{x,y) 

-*-xy ^ y -'xy " 





Since the furthest apart any two leaves may be is a distance of n — 2 internal nodes, this scaling 
will result in integral coordinates. Thus we can equivalently say that the BME tree for the vector 
S is the binary tree t that minimizes 5 ■ x(f) for all binary trees on leaves X. 

Consider a tree metric dx which arises from a binary tree T with five leaves {a, b, c, d, e}. Let 
the interior edges e\ and e 2 have lengths Wi = l(ei). 

Theorem 3. Let T, the tree for which dx is a tree metric, have cherries {a, b} and {c, d}. 

Let: 

U 1 = 2e ac ~\~ C a( i €b c -f- Cbd + 3€be € ce 
V 2 = 2e ac + e ae — ebc + 3 Cbd + e&e — Ccd 
V3 = — £ac + Cad + 3e ae + 2ebc + Cbd, ~ lCce 
2/4 = — c ac + 3 e a d + c ae + 2ef, c + Cb e ~ c c d 

Let: 

Zl = — £qc + 3 Cad + Cbd + Cbe ~ C c d + 2e ce 
Z2 = — C ac + 3e ae + Cbd + Cbe + 2e c d — Cce 
Z 3 = Cad + Cae — Cbe + 3 Cbd ~ C c d + 2e ce 
Z4 = C a d + C ae ~~ Cbc + 3 Cbe + 2 t c d ~ C c e 

Then the BME method will return the correct tree T if and only if: 

4w2 > e ac + Cbc + 2 Cde — min (e a d + £ bd + 2 e ce , e ae + Cb e + 2e c d) 

6wi + 4 w 2 > 3e a b + 2e de - min(j/i, y 2 , 2 / 3 , 2 / 4 ) 

6 uq + 6 W 2 > 3e a f, + 3e de - min(3e oe + 3e M , 3e a(i + 3e fae ) 

4wq + 6 w 2 > 2e a b + 3e de ~ nrin(z 1 , z 2 , z 3 , z±) 

Aw\ > 2e a b + c c d + e ce — min(e a( 2 + e ae + 2eb c , 2e ac + Cbd + Cbe) 

Proof. The BME method will return the correct tree T if and only if 

(dr + e) • x(T) < (dr + e) • x(f) 

for all alternate trees t. This is true since the 1-skeleton of the BME polytope for n = 5 is the 
complete graph on the 15 vertices. 

Further, the above inequality holds iff 

d T ■ (x(i) - x(T)) > e • (x(T) - x(f)) 

for all alternate trees t. Note that all the trees with five leaves have the same topology. 

There are 14 other possible trees t. These separate into 5 sets of trees for which the left hand 
side of the above inequality is respectively 4w 2 , 6uq + 4w 2 , Qw\ + 6u> 2 ,4 w\ + 6u> 2 , or 4 w\. For each 
of these we collect the right hand sides, and take their maximum. □ 

4 Probability distribution of the output tree via the NJ 
method 

4.1 H-representation of NJ cones pj 

Recall that the tree metric dx, w = {d xy )i< x ,y<n is a symmetric matrix with d xx = 0. We can flatten 
the entries in the upper triangle (diagonal entries are omitted) by columns: 

d X y — d (y-l)(y-2) | 



where 1 < x < n — 1 and x + 1 < y < n. Let d = (d\, d 2 , • ■ •, d m ) , to := ( 2 ), be the vector of 
tree metric after flattening. Notice here this flattening defines a one-to-one mapping between the 
indices: 

If : {(x, y) G Z : 1 < x < n — 1 , x + 1 < y < n} —> { 1 , 2 ,..., to }, If(x, y) = —-^-- + x. 

In NJ algorithm, we first compute the Q-criterion (cherry picking criterion): 

n n 

Qxy — 2 )d X y ^ ^ ' d Z y. 

2 = 1 2=1 

Similar as the flattened tree metric d, the Q-criterion can also be flattened to a to dimensional 
vector q which can be obtained from d by linear transformation: 

q = ^ n) d, 

where matrix A^ is defined as: 

| n — 4 if i = j 

Ay } = S - 1 if * 7 ^3 an d {x, y} n {z, w} ^ 0 , 

[0 else 

where (x,y) = Ij 1 (i) and (z,w) = IJ 1 (j). 

Now each entry in q corresponds to a pair of nodes in T, the next step of NJ algorithm is to 
find the minimum entry of q and join the corresponding two nodes as a cherry (“cherry picking”), 
then these two nodes will be replaced by a new node and the tree metric will be updated as d' 
(the dimension is reduced). We can see NJ algorithm proceeds by picking one cherry and reducing 
the size of the tree metric in each iteration until a binary tree is reconstructed. Without loss of 
generality and for the convenience of expression, we will only give details for the first iteration and 
assume the cherry we pick is (n — 1, n) in the rest part of this section. 

First, to make cherry (n— 1, n) be the one to be picked, q rn = qi f ( n -i,n) needs to be the minimum 
in q. This means the following inequalities need to be satisfied: 

(Im-l, —Im-i)q > 0 =► H^d > 0, = (J m _ 1; -l m - 1 ) J 4("). 

Note that if an arbitrary cherry is picked, then a permutation of columns need to be assigned to 

Then, after picking cherry (n — 1, n), we join these two nodes as the new node (n — 1)*. Again, 
we can produce the new reduced tree metric from d by linear transformation: 

d' = fid, 

where fi = (nj) G i( m -" +1 ) xm , 

'l if 1 < < = i < (" 2 2 ) 

1/2 if (V) + l<i<(V)- j 

^=1/2 if(V)+l<i<(V). J 

|-l/2 if("" 2 )+l<i<(V). J 

else 


= i 

=i+n— 2 . 

= TO. 


0 



At last, after including inequalities in all iterations, by the shifting lemma in [6], we also include 
the following equalities: V node x, 


,4 0, (s ^)i 


if X G I f 1 (z) 
else 


4.2 H-representation of 5 taxa NJ cones 

There is only one tree topology for 5 taxa tree. Therefore, without loss of generality, we assume 
our true tree is Figure [2(a)] 



Figure 2: The 5 taxa tree used to generate data, all edges has length 1 


For 5 taxa tree, the flattening for tree metric is: 

xy: 12 13 23 14 24 34 15 25 35 45 

d = (di d 2 d 3 d± d 5 d e dj d 8 dg dio) 


In Section 


4.1 


we can see that the permutation of columns for H^ n > and R depends on the cherry 
we pick. This means that we should compute NJ cones for all ordering of cherry picking (see the 
two orderings of cherry picking in Figure [2(b)| for example). 


There are four orderings of cherry picking. First we can pick cherry (1, 2) then pick cherry (4, 5), 
which we denote as (1,2) — (4, 5). Use a similar notation we have the other three: (1, 2) — ((1, 2), 3), 
(4, 5) — (1, 2), and (4, 5) — (3, (4, 5)). Take the ordering (4, 5) — (3, (4,5)) for example, use the results 
























in Section 4.1 we can obtain the following linear constraints on d: 
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Although it is not obvious to see, we found that the linear constraints for orderings (1,2) — (4,5) 
and (1,2) — ((1,2),3) are exactly the same, and the linear constraints for orderings (4,5) — (1,2) 
and (4,5) — (3, (4,5)) are exactly the same. Therefore we only consider two NJ cones: the one for 
(1,2) — (4,5) (denote as C( li2 )-( 4 , 5 )), and the one for (4,5) — (1,2) (denote as C( 4i5 )_( 1;2 )). 


4.3 


Computing the probability that NJ reconstructs the correct 5 taxa 
tree 


For the 5 taxa tree given in Figure 2(a) under the random errors model, we know that the flattened S 


should follow a multi-variate normal (MVN) distribution with mean fj, = (2, 3, 3,4,4,3,4,4,3, 2) and 
covariance matrix S = <j 2 I\q. To trace the performance of NJ algorithm under different variation, 
we let er 2 to be a set of values in (0,1] and then compute the probability that NJ algorithm 
reconstructs the correct tree for each value of cr 2 . 

For a given a 2 , it is trivial to see that the probability that NJ algorithm returns the right tree 
is Pr(5 £ C(i > 2)-(4,5)) + Pr(S £ C( 4j 5 )_(i i 2 )) — Pr(6 £ C(i i2 )_( 4> 5 ) D C( 4i 5 )_(i >2 )). 

We used software Polymake [TO] and verified that the dimension of C( 1)2 )_( 4 , 5 ) D C( 4) 5 )_( 1>2 ) 
is lower than both of them, therefore Pr(5 £ C( 12 )-( 4 , 5 ) FI C( 4j5 )_ (1i2 )) = 0. Polymake also gives 
us the facets of these two NJ cones. For example, the facets of C( 4>5 )_( 1)2 ) are: 
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With these facets, we can use the R function “pmvnorm{mvtnorm}” with GenzBretz algorithm to 
compute Pr{5 £ C( li2 )_( 4 j5 )) and Pr(5 £ C( 4i5 )_( li2 )). 


5 Computational experiments 


In this section, we show both the theoretical 
reconstructs the correct tree, as well as NJ 
experiments, we set all branch lengths to be 
of a 2 . 

In Figure [ 3 J when a 2 is increasing, the 
probability of 5 taxa tree will dramatically 
decrease faster than the probability of 4 taxa 
tree because we have more constraints to sat¬ 
isfy in 5 taxa tree which leads to lower prob¬ 
abilities. 


and simulation probabilities that the four point rule 
algorithm and BME method. In our computational 
l’s and compute the probabilities for different values 


In Figure 4(a) we calculated the theoret¬ 


ical probability that the NJ method recon¬ 
structs the correct 5 taxa tree based on Sec¬ 
tion [4j For the simulation, we fix the 5 taxa 
tree in Figure |2(a)| with all branch lengths to 
be Vs, and add i.i.d. normal random errors 
e' xy s to the pairwise distance matrix. Then 
we use R function “nj{ape}” from the “ape” 
package in R |12| to reconstruct the tree. If 
the RF distance equals 0, it means that NJ 
method successfully returns the correct tree. 
Our simulation is based on 10,000 random 
trees. Figure [4(a) shows that the theoretical 
probabilities perfectly match the simulation 
result. 



Figure 3: Theoretical probability that four point rule 
will return the correct 4 taxa tree, and that 5 taxa 
tree metric with random errors still obeys the original 
four-point inequalities on each subset of four leaves. 




(a) Theoretical Probability and Simula- (b) Theoretical Probability and Simulation for 

tion for NJ on 5 Taxa Tree BME on 5 Taxa Tree 


Figure 4: Probability distributions for five leaves 
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Figure 5: Comparison between BME and NJ on 5 Taxa Tree 


In Figure |4(b)[ we calculated the theo¬ 
retical and simulated probabilities that the 
BME method will return the correct 5 taxa 
tree. For the theoretical probability, we gen¬ 
erate 100,000 sets of random errors, and 
check whether the theoretical rule is satis¬ 
fied. In the end, we return the percent¬ 
age. For the simulation, we generate ran¬ 
dom trees in the similar way to what we did 
for NJ algorithm. Then we used R function 
“fastme.bal{ape}” to reconstruct the tree. 
Again we used RF distance to check if the 
BME method successfully returned the cor¬ 
rect tree. Our simulation is based on 10,000 
random trees. Figure |4(b)| shows that the 
theoretical probabilities perfectly match the 
simulation result. 

Figure [5] shows that there is almost no 
difference between BME and NJ methods on 
5 taxa tree in both theoretical probabilities 
and simulation result. 


Stochastic Safety Radius 



Figure 6: Stochastic safety radii for the NJ and BME 
methods for n = 5. The x-axis represents rj and the 
y-axis represents the upper bound for c/w m i n with 
Wmin = 1 for this experiment. 


Figure [6] shows the stochastic safety radii for the NJ and the BME methods for n = 5 and 
w m in = 1. As shown in Figure [6] both stochastic safety radii are basically almost identical in this 
case since the probability distributions P(M(S) = T ) for the NJ and for the BME methods are 
almost identically same shown in Figure [5] 
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